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Abstract 

With the advent of drones, aerial video analysis becomes 
increasingly important; yet, it has received scant attention 
in the literature. This paper addresses a new problem of 
parsing low-resolution aerial videos of large spatial areas, 
in terms ofl) grouping, 2) recognizing events and 3) assign¬ 
ing roles to people engaged in events. We propose a novel 
framework aimed at conducting joint inference of the above 
tasks, as reasoning about each in isolation typically fails in 
our setting. Given noisy tracklets of people and detections 
of large objects and scene surfaces fe.g., building, grass), 
we use a spatiotemporal AND-OR graph to drive our joint 
inference, using Markov Chain Monte Carlo and dynamic 
programming. We also introduce a new formalism of spa¬ 
tiotemporal templates characterizing latent sub-events. For 
evaluation, we have collected and released a new aerial 
videos dataset using a hex-rotor flying over picnic areas 
rich with group events. Our results demonstrate that we 
successfully address above inference tasks under challeng¬ 
ing conditions. 

1. Introduction 

1.1. Motivation and Objective 

Video surveillance of large spatial areas using unmanned 
aerial vehicles (UAVs) becomes increasingly important in a 
wide range of civil, military and homeland security appli¬ 
cations. For example, identifying suspicious human activi¬ 
ties in aerial videos has the potential of saving human lives 
and preventing catastrophic events. Yet, there is scant prior 
work on aerial video analysis [13, 12, 29], which for the 
most part is focused on tracking people and vehicles (with 
few exceptions [23]) in relatively sanitized settings. 



Figure 1: Our low-resolution aerial videos show top-down 
views of people engaged in a number of concurrent events, 
under camera motion. Different types of challenges are 
color-coded. The red box marks a zoomed-in video part 
with varying dynamics among people and their roles De¬ 
liverer and Receiver in Exchange Box. The green marks 
extremely low resolution and shadows. The blue indicates 
only partially visible Car. The cyan marks noisy tracking 
of person and the small object Frisbee. 

Towards advancing aerial video understanding, this pa¬ 
per presents a new problem of parsing extremely low- 
resolution aerial videos of large spatial areas, such as picnic 
areas rich with co-occurring group events, viewed top-down 
under camera motion, as illustrated in Fig. 1 and 2. Given 
an aerial video, our objectives include: 

1. Grouping people based on their events; 

2. Recognizing events present in each group; 

3. Recognizing roles of people involved in these events. 
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Figure 2: The main steps of our approach. Our recognition accounts for the temporal layout of latent sub-events, people’s 
roles within events (e.g., Guide, Visitor), and small objects that people interact with {e.g.. Box, trash bin). We iteratively 
optimize groupings of the foreground trajectories, infer their events and human roles (color-coded tracks) within events. 


1.2. Scope and Challenges 

As illustrated in Fig. 1, we focus on videos of relatively 
wide spatial areas (e.g., parks with parking lots) with inter¬ 
esting terrains, taken on-board of a UAV flying at a large 
altitude (25m) from the ground. People in such videos are 
formed into groups engaged in different events, involving 
complex n-ary interactions among themselves (e.g., a. Guide 
leading Tourists in Group Tour), as well as interactions with 
objects (e.g.. Play Frisbee). Also, people play particular 
roles in each event (e.g.. Deliverer and Receiver roles in 
Exchange Box). 

1. Low resolution. People and their portable objects are 
viewed at an extremely low resolution. Typically, the size of 
a person is only 15x15 pixels in a frame, and small objects 
critical for distinguishing one event from another may not 
be even distinguishable by a human eye. 

2. Camera motion makes important cues for event 
recognition (e.g., object like Car) only partially visible or 
even out of view, and thus may require seeing longer video 
footage for their reliable detection. 

3. Shadows in top view make background subtraction 
very challenging. 

Unfortunately, popular appearance-based approaches to 
detecting people and objects used to produce input for rec¬ 
ognizing group events and interactions [25, 7, 32, 16, 30, 9] 
do not handle the above three challenges. Thus we have to 
depart from the appearance-based event recognition. 

In addition, in the face of these challenges, the state of 
the art methods in people and vehicle tracking frequently 
miss to track moving foreground, and typically produce 
short, broken tracklets with a high rate of switched track 
IDs. 

4. Space-time dynamics. Our events are character¬ 
ized by both very large and very small space-time dynamics 
within a group of people. For example, in the event of a 
line forming in front of a vending machine, called Queue 


for Vending machine, the participants may be initially scat¬ 
tered across a large spatial area, and may form the line very 
slowly, while partially occluding one another when closely 
standing in the line. 

1.3. Overview of Our Approach 

As Fig. 2 illustrates, our approach consists of two main 
steps: 

1. Preprocessing. We ground our approach onto noisy 
detections and tracking. Foreground tracking under camera 
motion is made feasible by registering video frames onto 
a reference plane. By frame registration, we generate a 
panorama for scene labeling. Due to the challenges men¬ 
tioned in Sec. 1.2, tracking of small portable objects and 
people produces highly unreliable frequently broken track- 
lets, with a high miss rate. We improve the initial tracking 
results by agglomeratively clustering tracklets into longer 
trajectories based on their spatial layout and velocity. We 
detect large objects (e.g. buildings, cars) using the approach 
of [31], and classify superpixels [1] of the panorama for 
scene labeling. 

2. Inference. We seek event occurrences in the space- 
time patterns of the foreground trajectories and their re¬ 
lations with the detections of objects in the scene. To 
constrain our recognition hypotheses under uncertainty, we 
resort to domain knowledge represented by a probabilis¬ 
tic grammar - namely, a spatiotemporal AND-OR graph 
(ST-AOG). ST-AOG encodes decompositions of events into 
temporal sequences of sub-events. Sub-events are defined 
by our new formalism called latent spatiotemporal tem¬ 
plates of n-ary relations among people and objects. The 
templates jointly encode varying spatiotemporal relations of 
characteristic roles of all people, as well as their interactions 
with objects, while engaged in the event. 

We specify an iterative algorithm based on Markov 
Chain Monte Carlo (MCMC [15]) along with dynamic pro- 
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Figure 3: A part of ST-AOG for Exchange Box. The nodes are hierarchically connected (solid blue) into three levels, where 
the root level corresponds to events, middle level encodes sub-events, and leaf level is grounded onto foreground tracklets 
and small static objects in the video. The lateral connections (dashed blue) indicate temporal relations of sub-events. The 
colored pie-chart nodes represent templates of n-ary spatiotemporal relations among human roles and objects (see Fig. 4). 
The magenta edges indicate an inferred parse graph which recognizes and localizes temporal extents of events, sub-events, 
human roles and objects in the video. 


gramming (DP) to jointly infer groups, events and human 
roles. 

1.4. Prior Work and Our Contributions 

Our work is related to three research streams. 

Event Recognition in Aerial Videos. Prior work on 
aerial image and video understanding typically puts restric¬ 
tions on their settings for limited tasks. For example, [27] 
requires robust motion segmentation and learning of object 
shapes for tracking objects; [12] recognizes people based 
on background subtraction and motion; and [29] depends 
on appearance-based regressor and background subtraction 
for tracking vehicles. Regarding the objectives, these ap¬ 
proaches mainly focus on detecting and tracking people or 
vehicles [38, 23, 13]. We advance prior work by relaxing 
their assumptions about the setting, and by extending their 
objectives to jointly infer groups, events, human roles. 

Group Activity Recognition. Simultaneous tracking of 
multiple people, discovering groups of people, and recog¬ 
nizing their collective activities have been addressed only 
in every-day videos, rather than aerial videos [8, 32, 17, 10, 
18, 7, 6, 5, 34, 36]. Also, work on recognizing group activ¬ 
ities in large spatial scenes requires high-resolution videos 
for a “digital zoom-in” [4]. As input, these approaches use 
person detections along with cues about human appearance, 
pose, and orientation — i.e., information that cannot be re¬ 
liably extracted from our aerial videos. There are also some 
trajectory-based methods for event recognition [21, 35, 20], 
but they focus on simpler events compared to what we dis¬ 
cuss in this paper. Regarding the representation of collective 
activities, prior work has used a descriptor of human loca¬ 
tions and orientations, similar to shape-context [7, 5]. We 


advance prior work with our new formalism of latent spa¬ 
tiotemporal template of human roles and their interactions 
with other actors and objects. 

Recognition of Human Roles. Existing work on rec¬ 
ognizing social roles and social interactions of people typi¬ 
cally requires perfect tracking results [30], reliable estima¬ 
tion of face direction and attention in 3D space [9], de¬ 
tection of agent’s feet location in the scene [41], and thus 
are not applicable to our domain. Our approach is related 
to recent approaches aimed at jointly recognizing events 
and social roles by identifying interactions of sub-groups 
[10, 18, 16, 14]. 

Contributions: 

1. Addressing a more challenging setting of aerial videos; 

2. New formalism of latent spatiotemporal templates of 
n-ary relations among human roles and objects; 

3. Efficient inference using dynamic programming aimed 
at grouping, recognition and localizing temporal ex¬ 
tents of events and human roles 

4. New dataset of aerial videos with per-frame anno¬ 
tations of people’s trajectories, object labels, roles, 
events and groups. 

2. Representation 

2.1. Representing of Group Events by ST-AOG 

Similar with hierarchical representation in [11, 19, 24, 
26], domain knowledge is formalized as ST-AOG, depicted 
in Fig. 3. Its nodes represent the following four sets of 
concepts: events Ae = {Ei}\ sub-events = {La}\ 
human roles Ar = {Rj}\ small objects that people inter¬ 
act with Ao = {Oj}\ and large objects and scene surfaces 
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As = {Sj}. A particular pattern of foreground trajectories 
observed in a given time interval gives rise to a sub-event, 
and a particular sequence of sub-events defines an event. 

Edges of the ST-AOG represent decomposition and tem¬ 
poral relations in the domain. In particular, the nodes are 
hierarchically connected by decomposition edges into three 
levels, where the root level corresponds to events, middle 
level encodes sub-events, and leaf level is grounded onto 
foreground tracklets and object detections in the video. The 
nodes of sub-events are also laterally connected for captur¬ 
ing “followed-by” temporal relations of sub-events within 
the corresponding events. 

ST-AOG has special types of nodes. An AND node. A, 
encodes a temporal sequence of latent sub-events required 
to occur in the video so as to enable the event occurrence 
(e.g., in order to Exchange Box, the Deliverers first need to 
approach the Receivers, give the Box to the Receivers, and 
then leave). For a given event, an OR node, V, serves to en¬ 
code alternative space-time patterns of distinct sub-events. 

2.2. Sub-events as Latent Spatiotemporal Tem¬ 
plates 

A temporal segment of foreground trajectories corre¬ 
sponds to a sub-event. ST-AOG represents a sub-event as 
the latent spatiotemporal template of n-ary spatiotemporal 
relations among foreground trajectories within a time in¬ 
terval, as illustrated in Fig. 4. In particular, as an event is 
unfolding in the video, foreground trajectories form char¬ 
acteristic space-time patterns, which may not be seman¬ 
tically meaningful. As they frequently occur in the data, 
they can be robustly extracted from training videos through 
unsupervised clustering. Our spatiotemporal templates for¬ 
malize these patterns within the Bayesian framework using 
unary, pairwise, and n-ary relations among the foreground 
trajectories. In addition, our unsupervised learning of spa¬ 
tiotemporal templates address unstructured events in a uni¬ 
fied manner. Namely, more structured events need more 
templates and an unstructured one is represented by a sin¬ 
gle template. 

Unary attributes. A foreground trajectory, T = 
[r^,..., ,...], can be viewed as spanning a number of time 

intervals, Tk = where T^ = T{rk). Each tra¬ 

jectory segment, T^, is associated with unary attributes, 
0 = Elements of the role indicator vector 

r^{l) = 1 if belongs to a person with role I G Ar or 
object class I G Aq; otherwise r^{l) = 0. The speed in¬ 
dicator = 1 when the normalized speed of T^ is greater 
than a threshold (we use 2 pixels/sec); otherwise, = 0. 
Elements of the closeness indicator vector c^(/) = 1 when 
is close to any of the large objects or types of surfaces 
detected in the scene indexed by / G As, such as Building, 
Car, for a threshold (70 pixels); o.w., c^(l) = 0. 



Figure 4: Three example templates of n-ary spatiotemporal 
relations among foreground trajectories extracted from the 
video (XYT-space) for the event Exchange Box. The recog¬ 
nized roles Deliverers, Receivers and the object Box in each 
template are marked cyan, blue and purple, respectively. 
Spatiotemporal templates are depicted as colored pie-chart 
nodes in Fig. 3. 


Pairwise relations, of a pair of trajectory segments, 
and are aimed at capturing spatiotemporal rela¬ 
tions of human roles or objects represented by the two tra¬ 
jectories, as illustrated in Fig. 4. The pairwise relations 
are specified as: 0^^/ = wh^vQ 

djj, is the mean distance between T^ and T^,; Ojj, is the 
angle subtended between T^ and T^,; and the remaining 
three pairwise relations check for compatibility between 
the aforementioned binary relations as: = r^ 0 r^,, 

Sjj, = 5 ^ 0 Sj,, Cjj, = 0 Cj,, where 0 denotes the 

&onecker product. 

n-ary relations. Towards encoding unique spatiotempo¬ 
ral patterns of a set of trajectories, we specify the follow¬ 
ing n-ary attribute. A set of trajectory segments, Gi{rk) = 
= {Cj}, can be described by a 18-bin histogram of 
their velocity vectors, counts orientations of velocities 
at every point along the trajectories in a polar coordinate 
system: 6 bins span the orientations in [0, 2tt], and 3 bins 
encode the locations of trajectory points relative to a given 
center. As the polar-coordinate origin, we use the center 
location of a given event in the scene. 

Unsupervised Extraction of Templates. Given training 
videos with ground-truth partition of all their ground-truth 
foreground trajectories G into disjoint subsets G = {Gi}. 
Every Gi can be further partitioned into equal-length time 
intervals Gi = {G^} (|r^| = 2sec). We use K-means clus¬ 
tering to group all {^ij}, and then estimate spatiotemporal 
templates {La} as representatives of the resulting clusters 
a. For K-means clustering, we use ground-truth values of 
the aforementioned unary and pairwise relations of 
In our setting of 11 categories of events occurring in aerial 
videos, we estimate |Al| = 27 templates. 
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3. Formulation and Learning of Templates 

Given the spatiotemporal templates, Al = {La}, ex¬ 
tracted by K-means clustering from training videos (see 
Sec. 2.2), we will conduct inference by seeking these latent 
templates in foreground trajectories of the new video. To 
this end, we define the log-likelihood of a set of foreground 
trajectories G = {Tj} given 1/^ G as 

logp(G|ia) (x'^wl- cl)j + +wl h, 

3 33 ' 

= Wa-iY^ H =Wa--tp. 

3 33 ' 

( 1 ) 

where the bottom equation of (1) formalizes every template 
as a set of parameters Wa = appropriately 

weighting the unary, pairwise and n-ary relations of G, -0. 
Recall that our spatiotemporal templates are extracted from 
unit-time segments of foreground trajectories in training. 
Thus, the log-likelihood in (1) is defined only for sets G 
consisting of unit-time trajectory segments. 

From ( 1 ), the parameters Wa can be learned by maxi¬ 
mizing the log-likelihood of extracted from the cor¬ 
responding clusters a of training trajectories. 

The log-posterior of assigning template La to longer 
temporal segments of trajectories, falling in r = 
t' < t, is specified as 

t 

log p{La{T)\G{T))<X Y logp{G’‘\La) + logp(La(r)) 
k=t' 

( 2 ) 

where p{La{r)) is a log-normal prior that La can be as¬ 
signed to a time interval of length |r |. The hyper-parameters 
of p{La{r)) are estimated using the MLE on training data. 

4. Probabilistic Model 

A parse graph is an instance of ST-AOG, explaining the 
event, sequence of sub-events, and human role and object 
label assignment. The solution of our video parsing is a set 
of parse graphs, W = [ppi], where every ppi explains a 
subset of foreground trajectories, Gi C G, as 

PQi = {ei,Ti = {HTi,u)}, {fij}}^ ( 3 ) 

where G Ae is the recognized event conducted by Gp, 

= [ti,o^ti,T] is the temporal extent of in the video 
starting from frame ti^o and ending at frame fi,T; 
are the templates (i.e., latent sub-events) assigned to non¬ 
overlapping, consecutive time intervals Ti^u G r^, such that 
\ri\ = ki,w|; and Vij is the human role or object class 
assignment to jth trajectory Tij of Gi. 

Our objective is to infer W that maximizes the log- 
posterior logp(lT^|G) oc —£{W\G), given all foreground 
trajectories G extracted from the video. The corresponding 


energy £{W\G) is specified for a given partitioning of G 
into N disjoint subsets Gi as 

N 

£{W\G)(x^ ^-logp(AeJV 

root) E[ - logp(ALjVe,) 

i=l , ^ u , ^ ^ 

select event select template La 

-logp{La{TiP\Gi{TiP)]\ 

'-V-' ^ 

assign template 

( 4 ) 

where Gi{ri^u) denotes temporal segments of fore¬ 
ground trajectories falling in time intervals \Ti\ = 
and logp{L{TiP\Gi{TiP) is given by (2). 
Also, logp(AeJVroot) and logp(AL„|VeJ are the log- 
probabilities of the corresponding switching OR nodes in 
ST-AOG for selecting particular events ei e Ae and spa¬ 
tiotemporal templates La G Al. These two switching 
probabilities are simply estimated as the frequency of cor¬ 
responding selections observed in training data. 

5. Inference 

Given an aerial video, we first build a video panorama 
and extract foreground trajectories G. Then, the goal of in¬ 
ference is to: (1) partition G into disjoint groups of trajec¬ 
tories {Gi} and assign label event G Ae to every Gp, (2) 
assign human roles and object labels Vi^j to trajectories Vi^j 
within each group Gp, and 3) assign latent spatiotemporal 
templates L{ri^u) G A^ to temporal segments Ti^u of fore¬ 
ground trajectories within every Gi. For steps (1) and (2) we 
use two distinct MCMC processes. Given groups Gi, event 
labels ei and role assignment Vi^j proposed in (1) and (2), 
step (3) uses dynamic programming for efficient estimation 
of sub-events L{t) and their temporal extents r. Steps (1)- 
(3) are iterated until convergence, i.e., when £{W\G), given 
by (4), stops decreasing after a sufficiently large number of 
iterations. 

5.1. Grouping 

Given G, we first use [10] to perform initial cluster¬ 
ing of foreground trajectories into atomic groups. Then, 
we apply the first MCMC to iteratively propose either to 
merge two smaller groups into a merger, with probability 
p{l) = 0.7, or to split a merger into two smaller groups, 
with probability p{2) = 0.3. Given the proposal, each re¬ 
sulting group Gi is labeled with an event G Ae (we enu¬ 
merate all possible labels). In each proposal, the MCMC 
jumps from current solution IT^ to a new solution W' gen¬ 
erated by one of the dynamics. The acceptance rate is 

a = min |l, |, where the proposal dis- 

tribution Q{W W') is one ofp(l) orp(2) depending on 
the proposal, and p {W\G) is given by (4). 
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Figure 5: Our DP process can be illustrated by this DAG (di¬ 
rected acyclic graph). An edge between L^, and means 
the transition La' La follows the rule defined in ST-AOG 
and the time interval [ta ', ta] is assigned with template La. 
In this sense, with the transition rules and the prior defined 
in ( 2 ) (we do not consider the assignment with low prior 
probability), we can define the edges of such DAG. So the 
goal of DP is equivalent to finding a shortest path between 
source and sink. The red edges highlight a possible path. 
Suppose we find a path source L^ ^ Lf^ sink. 
This means that we decompose [0, T] into 2 time intervals: 
[0,8^t], [S6t^T], and they are assigned with template I /3 
and Li respectively. 


5.2. Human Role Assignment 

Given a partitioning of G into groups {Gi} and their 
event labels {e^}, we use the second MCMC process within 
every Gi to assign human roles and object labels to trajec¬ 
tories. Each trajectory Vi^j in Gi is randomly assigned with 
an initial human-role/object label Vi^j for solution pgi. In 
each iteration, we randomly select Vi^j and change it’s role 
label to generate a new proposal pg'-. The acceptance rate 


IS a = 


= min |l, 


Qipgi^P9'i)p(pg'i\Gi) 


}• 


Q{P9i^pg'i) _ 
Qipg'^pgi) - 


Q(P9'i^P9i)p(P9i\Gi) 

1 and p{pg'i\Gi) is maximized by dynamic programming 
specified in the next section 5.3. 


5.3. Detection of Latent Sub-events with DP 


From steps (1) and (2), we have obtained the trajectory 
groups {Gi}, and their event {ci} and role labels {vi^j}. 
Every Gi can be viewed as occupying time interval of = 
The results of steps (1) and (2) are jointly used 
with detections of large objects [Si] to estimate all unary, 
pairwise, and n-ary relations 'ijji of every Gi. Then, we 
apply dynamic programming for every Gi in order to find 
latent templates L{ri^u) ^ and their optimal durations 
G In the sequel, we drop notion i for the 

group, for simplicity. 

The optimal assignment of sub-events can be formulated 
using a graph, shown in Fig. 5. To this end, we partition 
[to, tr] into equal-length time intervals {[t/c_i, t/^]}, where 
tk — tk-i — = 2sec. Nodes L^ in the graph repre¬ 

sent the assignment of templates La G: Al to the intervals 
[t/e_i, t/c]. The graph also has the source and sink nodes. 


Directed edges in the graph are established only between 
nodes L^ and L^, 1 < k' < k,to denote a possible assign¬ 
ment of the very same template La to the temporal sequence 
[tk ', tk]. The directed edges are assigned weights (a.k.a. be¬ 
lief messages), m{L^ 5 T^), defined as 

= log p{La{tk' ,tk)\Gi{tk' ,tk)), (5) 

where logp{La{tk' , tk)\Gi{tk' , t/c)) is given by (2). Conse¬ 
quently, the belief of node L^ is defined as 

b{L^) = rnax b{L^,) m{L^ :T^). [Forward pass] 

( 6 ) 

Here b{L^) = 0. We compute the optimal assignment 
of latent sub-events using the above graph in two passes. In 
tho forward pass, we compute the beliefs of all nodes in the 
graph using ( 6 ). Then, in the backward pass, we backtrace 
the optimal path between the sink and source nodes, in the 
following steps: 

0: Let tk ^ tT\ 

1: Find the optimal sub-event assignment at time tk as 
L^^ = argmaxa b{L^); let a ^ a*; 

2: Find the best time moment in the past tk*, k'^Kk, 
and its best sub-event assignment as = 

maxa'^k' b{L^,)^m{L^\ L^); Let a^a* and k^k"". 

3: If tk > to, go to Step 2. 

6. Experiment 

Existing Datasets. Existing datasets on aerial videos, 
group events or human roles are inappropriate for our eval¬ 
uation. These aerial videos or images indeed show some 
group events, but the events are not annotated ([3, 2, 23, 
22]). Most aerial datasets are compiled for tracking eval¬ 
uation only [13, 12, 29]. Existing group-activity videos 
[ 8 , 32, 4, 18] or social role videos [41, 9, 16, 30, 14] are 
captured on or near the ground surface, and have sufficiently 
high resolution for robust people detection. Thus, we have 
prepared and released a new aerial video dataset ^ with the 
new challenges listed in Sec. 1.2. 

Aerial Events Dataset. A hex-rotor with a GoPro 
camera was used to shoot aerial videos at altitude of 25 
meters from the ground. The videos show two different 
scenes, viewed top-down from the fiying hex-rotor. The 
dataset contains 27 videos, 86 minutes, 60 fps, resolution of 
1920 X 1080, with about 15 actors in each video. All video 
frames are registered onto a reference plane of the video 
panorama. Annotations are provided ([37]) as: bounding 
boxes around groupings of people, events, human roles, and 
small and large objects. The objects include: 1. Building, 2. 
Vending Machine, 3. Table & Seat, 4. BBQ Oven, 5. Trash 

^Dataset can be download from http : //www .stat.ucla.edu/ 
-tianmin.shu/AerialVideo/AerialVideo.html 
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Method 

Input setting 

Group 

Event 

Role 

Baseline Var 

[10] for grouping, [ ] for event and role classification. 

Ground-truth tracks + object annotation 

77.71% 

17.22% 

13.98% 

Baseline 

Baseline method as above. 

Tracking result 

39.64% 

16.94% 

5.53% 

Ours Varl 

Our full model 

Ground-truth tracks + object annotation 

95.48% 

96.38% 

89.94% 

Ours Var2 

Our full model 

Tracking result + object annotation 

87.55% 

54.75% 

28.86% 

Ours Var3 

Our full model 

Tracking result + group labeling 

N/A 

39.92% 

18.71% 

Ours Var4 

Our model without temporal event grammar 

Tracking result 

40.41% 

18.51% 

8.69% 

Ours 

Our full model 

Tracking result 

49.47% 

32.84% 

18.92% 


Table 1: Comparison of our method with baseline methods and variants of our approach. Our method yields best accuracy 
based on ground-truth bounding boxes and object labels compared to the baseline methods. Using noisy tracking and object 
detection results, the accuracy is limited, yet better than the baseline methods under the same condition. This demonstrates the 
advantages of our joint inference. When given access to the ground-truth of objects or people grouping, our results improve. 
Without reasoning about latent sub-events, accuracy drops significantly, which justifies our model’s ability to capture the 
structural variations of group events. 


Bin, 6. Shelter, 7. Info Booth, 8. Box, 9. Frisbee, 10. Car, 
11. Desk, 12. Blanket. The events include: 1. Play Frisbee, 
2. Serve Table, 3. Sell BBQ, 4. Info Consult, 5. Exchange 
Box, 6. Pick Up, 7. Queue for Vending Machine, 8. Group 
Tour, 9. Throw Trash, 10. Sit on Table, 11. Picnic. The 
human roles include: 1. Player, 2. Waiter, 3. Customer, 4. 
Chef, 5. Buyer, 6. Consultant, 7. Visitor, 8. Deliverer, 9. 
Receiver, 10. Driver, 11. Queuing Person, 13. Guide, 14. 
Tourist, 15. Trash Thrower, 16. Picnic Person. 

Evaluation Metrics. We split the 27 videos into 3 sets, 
such that different event categories are evenly distributed, 
and use a three-fold cross validation for our evaluation. 
Although our training and test videos show the same two 
scenes, we make the assumption that the layout of ground 
surfaces and large objects is unknown. Also, different 
videos in our dataset cover different parts of these large 
scenes, which are also assumed unknown. We evaluate ac¬ 
curacy of: i) grouping people, ii) event recognition, iii) role 
assignment. While our approach also estimates sub-events, 
note that they are latent and not annotated. The results are 
all time-averaged with the lengths of trajectories in each 
video. For specifying evaluation metrics we use the fol¬ 
lowing notation. G = {Gi} and G' = {G-} are the sets 
of groups in ground-truth and inference results respectively. 
Tij is the jth trajectory in ith group in ground-truth data, 
with duration of \Tij\, group label pij, event type Cij and 
human role Vij in ground-truth. So is TV in our inference. 
For group Gi, we call the best matched (i.e. overlapped) 
group in G' as M^. For group G'i, we call the best match 
group in G as M[. Then, precision and recall of grouping 
are 

^'^9= Yl { Yl a'ij) Y 

GiCiG Vij^Gi Vij^Gi 

Rcg= Y: ( J: ^{M'=g,j).\rf\/ Y ( 8 ) 

G'.eG' w'..eG'. 

Accuracy of grouping is Fg = 2 j(XjPvg + IjRcg). 

Event recognition accuracy Ee and role assignment ac¬ 


curacy Er are defined as 

Fe = ^2 ( ^ ^ / ^2 ^2 

G'.eG' r',,eG', G'.eG'r'., eg', 

X %J % % XJ X 

(9) 

H H J2 J2 

G'.eG' r'.GG' 

( 10 ) 

Baselines. To evaluate effectiveness of each module of 
our approach, we compare with baselines and variants of 
our method defined in Tab. 1. For the baselines we ex¬ 
tract the following low-level features on trajectories: shape- 
context like feature [8], average velocity, aligned orienta¬ 
tion, distance from each type of large objects. All elements 
of feature vectors are normalized to fall in [0, 1]. 

Results. We register raw videos by RANSAC over 
Harris Comer feature points, then apply method of [12] 
for tracking, which is based on background subtraction 
[40, 33]. We also use the detector of [31] to detect buildings 
and cars, while other static objects are inferred in scene la¬ 
beling. We do not detect portable objects, e.g., Frisbee and 
Box. 

We evaluate our approach on both annotated bounding 
boxes and real tracking results. Example qualitative results 
are presented in Fig. 6. As can be seen, the results are rea¬ 
sonably good. The quantitative results are shown in Tab. 
1 . Confusion matrices of event recognition and role assign¬ 
ment are shown in Fig. 7. Additional results are presented 
in the supplementary material. 

7. Conclusion 

We collected a new aerial video dataset with detailed 
annotations, which presents new challenges to computer 
vision and complements existing benchmarks. We speci¬ 
fied a framework for joint inference of events, human roles 
and people groupings using noisy input. Our experiments 
showed that addressing each of these inference tasks in iso¬ 
lation is very difficult in aerial videos, and thus provided 
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Figure 6: Visualization of results including groups (large bounding boxes), events (text) and human roles (small bounding 
boxes with text). In events with more than one role, we use the shaded bounding box to represent the second role; small 
portable objects are labeled with lighter color. From event and human role recognition, we can group people even when they 
are far from each other (e.g.,Play Frisbee and Sell BBQ). In the top-rightmost failure example, true event Pick Up is wrongly 
recognized as Exchange Box because one person’s trajectory is inferred as Box. In bottom-rightmost failure example, our 
event recognition is correct, but true Consultant role is wrongly inferred as Visitor role. 
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(a) event recognition on GT 


(b) event recognition on tracking result 


(c) role assignment on GT 


Figure 7: Confusion matrices of event recognition and role assignment result, (a) is event recognition result based on 
ground-truth (GT) bounding boxes and object labels; (b) is result based on real tracking and detections. From (a) and (b) 
we can see that Info Consult, Sit on Table, Serve Table cannot be easily distinguished from each other solely based on noisy 
tracklets. Some events (e.g. Group Tour) tend to be wrongly favored by our approach, especially when we do not observe 
some distinguishing objects, (c) is role assignment result confusion matrix within event class based on ground-truth bounding 
boxes and object labels. Each 2x2 block is a confusion matrix of role assignment within that event. 


justification for our holistic framework. Our results demon¬ 
strated significant performance improvements over base¬ 
lines when we constrained uncertainty in input features with 
domain knowledge. 

Our model is limited and can be extended in two direc¬ 
tions. First, we infer the function of the objects implicitly 
based on the group events currently. In the future, we wish 
to explicitly infer the functional map for a given site, in the 
sense that certain area corresponds to specific human activi¬ 
ties, e.g., dinning area, parking lot, etc. Unlike appearance- 
based aerial image parsing [28], the spatial segmentation 
will be guided by the spatiotemporal characteristics of hu¬ 


man activities. Second, similar to what [39] did for the pre¬ 
diction of individual intention, we would like to reason the 
intention of a group as another extension of our work. 
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